A Mixture of Generalized Hyperbolic Distributions 



Ryan P. Browne* and Paul D. McNicholas 
Department of Mathematics & Statistics, University of Guelph. 



Abstract 

We introduce a mixture of generalized hyperbolic distributions as an alternative to 
the ubiquitous mixture of Gaussian distributions as well as their near relatives of which 
the mixture of multivariate t and skew-i distributions are predominant. The mathe- 
matical development of our mixture of generalized hyperbolic distributions model re- 
lies on its relationship with the generalized inverse Gaussian distribution. The latter 
is reviewed before our mixture models are presented along with details of the afore- 
said reliance. Parameter estimation is outlined within the expectation-maximization 
framework before the performance of our mixture models is illustrated in clustering 
applications on simulated and real data. In particular, the ability of our models to 
recover parameters for data from underlying Gaussian, t-, and skew-t distributions is 
demonstrated. Finally, the role of these models as a superclass as well as the antici- 
pated impact of these models on the model-based clustering, classification, and density 
estimation literature is discussed with special focus on the role of Gaussian mixtures. 

1 Introduction 

Finite mixture models are based on the underlying assumption that a population is a convex 
combination of a finite number of densities. They therefore lend themselves quite natu- 
rally to classification and clustering problems. Formally, a random vector X arises from a 
parametric finite mixture distribution if, for all x C X, its density can be written /(x | 
•&) = Yl^i^gfgi*- I Og)i where n g > such that Yl g =i ^g = 1 are the mixing proportions, 
/i(x I g ), . . . , /g(x I Og) are called component densities, and $ = (77, Oq) is the vec- 

tor of parameters with 7T = (iti, . . . , tvq). The component densities /i(x | 0\), . . . , /g( x I &g) 
are usually taken to be of the same type, most commonly multivariate Gaussian. The 
popularity of the multivariate Gaussian distribution is due to its mathematical tractability 
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and flexibility for density estimation. In the event that the component densities are mul- 



*) = £ 



i 7 ^ 



tivariate Gaussian, the density of the mixture model is /(x 
where 0(x | n g , S 5 ) „ 

trix Hg. The idiom 'model-based cluster ing' is used to connote cluster i ng us ing mixture 



is the multivariate Gaussian density with mean /x and covariance ma- 



models. Model- based classif i cation (e.g., iDean et al.l . 120061 ; iMcNicholasl . |2010| ). or partial 



classification (cf. McLachlanl . Il992l . Section 2.7), can be regarded as a semi-supervised ver- 
sion of model-based c l usteri ng, while model-based discriminant analysis is supervised (cf. 



Hastie and Tibshiranil . Il996l ). 



The recent burgeoning of non-Gaus sian approaches to model-b ased clustering includes 
work o n the mult ivariate t-distribution (iPeel and McLachlanl. 120001). the skew-normal distri- 



butio n ( iLinl . 120091 ) , the skew-t distrib ution (ILinl.l2010l;lLee and McLachlanl. 1201 luVrbik and McNicholasl . 



20121). as well as oth er approaches (IKarlis and Meligkotsidoul . 120071 : lHandcock et al. 



2007; 



Browne et al.l . |2012|). In this paper, we add to the richness of the pallet of non-Gaussian 



mixture model-based approaches to clustering and classification by introducing a mixture of 
generalized hyperbolic distributions, which is a sort of superclass containing the aforesaid 
models as special or limiting cases (cf. Section [5]). 

In Section 121 our methodology is developed drawing on connections with the generalized 
inverse Gaussian distribution. Parameter estimation is described (Section [3]) before both 
simulated and real data analyses are used to illustrate our approach (Section H]). The paper 
concludes with a summary and suggestions for future work in Section |5j 



2 Methodology 



2.1 Generalized Inverse Gaussian Distribution 



The generalized inverse Gaussian (GIG) d istribution was introduced bv iGoodl f 



(1978 


) 


Halereen ( 


1979 


), and 


J0reensenl (1982) 



953ft and 



/rite Y ^ GIG(^, x, A) to indicate that the 
random variable Y follows a generalized inverse Gaussian (GIG) distribution with parameters 
(ip, x, A) and density 



p(y I A) 



A/2 X- 

y 



exp 



ipy + x/y \ 
2 J 



2K X {V^hl) 

for y > 0, where ip, % G M + , A Gl, and Kx is the modified Bessel function of the third kind 
with index A. There are several special cases of the GIG distribution, such as the gamma 
distribution (x = 0, A > 0) and the inverse Gaussian distribution (A = —1/2) 

oj/t] or uj - 



Setting x — an d ip = w/r] or u = y/ipx an d V = \fxN ) i we obtain a different but for 
our purposes, more meaningful parameterization of the GIG density, 



h(y I ^,7), A) 
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where rj > is a scale parameter, oj > is a concentration parameter and A is an index 
parameter. Herein, we write Y ^ I p (co, t], A) to indicate that a p-dimensional random variable 
Y has the GIG density as parameterized in ([2]). The GIG distribution has some attractive 
properties including the tractability of the following expected values: 



E[Y 



v- 



K 



A+l M 

K x (u) 



E[l/Y] = l - Kx " l{uJ 



1 Kx+i (u) 2A 
d 

E [log Y] = log 7] + — log K x (w) = log 7] + 



(3) 
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<9v ™° " A v " y ~° '' A' A (cu) <% 
2.2 Generalized Hyperbolic Distribution 



McNeil et al.l (120051 ) give the density of a random variable X following the generalized hy- 
perbolic distribution, 



X + (5 (x, /x | A) 
+ ck'A^c* 



1 (A-p/2)/2 



X 



[^/x] A/2 ^a- p /2 (^ + Q'A- 1 a][x + i(x,HA)] 



(4) 



(27T) 



p/2i A ,l/2 



A'a (v 7 *?) exp {(/x -x)' A 1 a} 



where S (x, fx | A) = (x — /a) A -1 (x — /a) is the squared Mahalanobis distance between x 
and /a, and # = (A, x, ty. A*, A, a) is the vector of parameters. Herein, we use the notation 
X ^ Q p (A, Xi Mi A, tt ) to indicate that a p-dimensional random variable X has the gener- 
alized hyperbolic density in (jl]). Note that we use A to denote the covariance because, in 
this parameterization, we need to hold |A| = 1 to ensure idenitifiability (cf. Section [2~3|) . 

A generalized hyperbolic random variable X can be generated by combining a random 
variable Y ^ GIG^, x, A) and a latent multivariate Gaussian random variable U ^ A"(0, A) 
using the relationship 

X = /a + Ya + VFU, (5) 
and it follows that X | (Y — y) ^ A/" (/a + yen, yA). Therefore, from Bayes' theorem, 

/(x | 2/)%) 



/(x) 



^ + a' A' 1 a. 



_X + <K X ,M | A 
and so we have F | (X = x) ^ 



y \+ P /2-l exp {_ ^ ^ + a 'A-l a ) + (x + 5 (X, /A 


A))/y]/2} 


2A' A -p/2 


+ a'A^a] [x + 5(x,/a 


A)]J 





GIG(^ + a'A^cx, x + <5 (x, /a | A) , A - p/2). 
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McNeil et al.l ( 120051 ) describe a variety of limiting cases for the generalized hyperbolic 
distribution. For A = 1, we obtain the multivariate generalized hyperbolic distribution such 
that its univariate margins are one- dimensional hyperbolic distributions, for A = (p + l)/2, 
we obtain the p-dimensional hyperbolic distribution, and for A = —1/2, we obtain the inverse 
Gaussian distribution. If A > and \ — > 0, we obtain a limiting case of the distribution 
know n as generalized, Bessel function or variance-gamma distribution ( IB arndorff- Nielsen! 
19781). If A = 1 , i\) = 2 and x ~ * 0? then we obtain the asymmetric Laplace distribution 
(cf. iKotz et al.l. 120011) and if a = 0, we have the symmetric generalized hyperbolic distri 



bution ( Barndorff-Nielsenl 1978 ). Other special and limiting cases include the multivariate 
normal- inverse Gaussian (MNIG) distribution ( iKarlis and Meligkotsidoul . 120071 ). the skew-t 
distribution as well as the multivariate skew-normal, and Gaussian distributions. 



2.3 Identifiability and Re-Parameterizations 

Suppose we relax the condition that | A| = 1, in which case we use S to denote the covariance 
matrix. An identifiability issue arises because the density of Xi ^ G p (\,x/c,cip, l-i,cH,coc) 
and X 2 ^ G P (\, X? i'l Mi ^ a ) is identical for any c G M + . Using A, with | A| = 1, instead of 
S, solves this problem but would be prohibitively restrictive for model-based clustering and 
classification applications. An alternative approach is to use the relationship in (|5]) to set 
the scale parameter rj = 1. This relationship is equivalent to X = /x + Yrjct + y/YrjXJ = /x + 
Y/3+ \^YU, where (3 = r/a, Y ^ 1, A) and U ^ A/"(0, S). Under this parameterization, 
the density of the generalized hyperbolic distribution is 



/(x I 0) 



Co' 



(2tt) p/2 |S| 1/2 K x (u) exp { - (x - p)' ^ l (3) 



(6) 



and Y | (X = x) - GIG (a; + u + 5 (x, /x|0) , A - p/2). We use Q*(X,u, fi, S,/3) 

to denote the density in ([6]) and it is this parameterization that is used when we describe 
parameter estimation (Section EJ). 



3 Parameter Estimation 

Parameter estimation is carried out using an expectation-maximization (EM) algorithm 



(iDempster et al.l . Il977l ). The EM algorithm is an iterative technique that facilitates maxi- 
mum likelihood estimation when data are incomplete or treated as being incomplete. In our 
case, the missing data comprise the group memberships and the latent variable. We assume 
a clustering paradigm so that none of the group membership labels are known. Denote 
group memberships by z ig , where Zi g = 1 if observation % is in component g and Zi g = 
otherwise. The latent variables Y\, . . . , Y n are assumed to follow GIG distributions and the 



i 



complete-data log-likelihood is given by 



n G 

Z c (0 I x,y,z) = ^2^2 Zig 

i=l g=l 



V 



log vr 9 + log (xi | fig + yiag^yiT.g) +log/i(^ | tu g ,\ g ) 

^ n G n G 

i=l g=l i=l g=l 



9=1 i=l 



where C does not depend on the model parameters. 

In the E-step, the expected value of the complete-data log-likelihood is computed. Be- 
cause our model is from the exponential family, this is equivalent to replacing the sufficient 
statistics of the missing data by their expected values in Z c ($ | x, w, z), where the missing 
data are the latent variables and the group membership labels. These two sources of miss- 
ing data are independent and so we are only required to calculate the marginal conditional 
distribution for the latent variable and group memberships given the observed data. We 
require following expectations: 

7T(,/(Xj | Og) 

Ik [%ig | x iJ = — = : Zig, 



E [Wi I x,, Z ig = 1] = V^rv =■ a i9 , 

E[l/Wi|x i> ^ = l] = i%±iM-^=: b i9 , 

1 d 



E [log(Wi) | Xi, Z l9 = 1] = log?? + (w) =: c 



and we use the notation n 9 = XT=i ^9 = 0-/ n g) Y!i=i Zig a i, B g = (l/n g ) J2i=i Zigk, and 
C g = (1 /rig) Y!i=i ZigCi hereafter. 

In the M-step, we maximize the expected value of the complete-data log-likelihood to 
get the updates for the parameter estimates. The update for the mixing proportions is 
Tig = Y17=i Zig/ n - Updates for \i g and cx g are given by 

Si=l x « Zig(A g big — 1) '52i=i yi iZig{big — B g ) 



A = n r~7~r~7 -. \ and a 



Sj=l Zig(Agbig — 1) Z^i = l Zig(Agb ig ~ l) 



respectively. 

The update for S 9 is given by 



t g = — ^2 Zighg{*i ~ A 9 )( x i - A fl )' - « fl ( x 9 - A fl )' - ( x 9 - Aj + Ag&gi&gY, ( 7 ) 

n 9 i=l 



where x 9 = (l/n g ) ^" = i^i fl Xj. To demonstrate that S 5 is positive-definite, first note that 



< E 



for alH = 1, . . . , n, from Jensen's inequality. It follows that l/ai g < b ig and so 



E 



n 



E 



Now, by replacing A g with (1/n) Y^i=i(zig/bi g ), we obtain 
and the inequality 



>^igbi g 



6- a » 



x g yv g yo 



holds, ensuring that Y> g is positive-definite. 

To update u g and X g we maximize the function 



log (K x (u)) + (A - l)C g ~-(A g + B g ) 



using a general optimization routine via t he optim package for R. 

The Aitken acceleration ( jAitkenl . Il926h can used to estimate the asymptotic maximum of 
the log-likelihood at each iteration of an EM algorithm and thereby to determine convergence. 
The Aitken acceleration at iteration k is 

_ - jW 



l(k)_i(k-i)> 

where is the log-likelihood at iteration k. An asymptotic estimate of the log-likelihood 
at iteration k + 1 is 

i (fc+i) = z (*) + ^J_ (z (fc+i)_ z (fc) )j 



and the algori t hm c an be considered to have converged when — < e ( Bohning et al 



1994 ; iLindsayi . Il995[ ). This criterion is used for the analyses in Section HI with e = 10" 



In many practical applications, the number of mixture components G is unknown. In our 
illustrative dat a analyses (Sec tion H]), G is treated as unknown and the Bayesian information 
criterion (BIC; ISchwarzl . 1 1 9 78h is used to select it. The BIC can be written as BIC = 2/(x, 6) — 
plogn, where Z(x, 0) is the maximized log-likelihood, 6 is the maximum likelihood estimate 
of 0, p is the number of free parameters in the model, and n is the number of observations. 
The use of the BIC for mixture model selection has been motivated through Baye s factors 
(IKass and Rafteryl . Il995l ; iKass and Wassermanl . Il995l : iDasgupta and Raftervl . fl998l ) and has 
become popular due to its widespread use within the Gaussian mixture modelling literature. 
While many alternatives have been proffered, none have proved superior. 
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Table 1: Mean parameter estimates from the application of our mixture of generalized hy- 
perbolic distributions to 100 simulated data sets from a two-component mixture of Gaussian 
distributions. 





9 = 1 




9 = 


2 


True 


Estimated 


True 


Estimated 




(3.00,3.00) 


(2.91,2.97) 


(-3.00,-3.00) 


(-2.74,-3.14) 


OLg 


(0.00,0.00) 


(0.09,0.03) 


(0.00,0.00) 


(-0.26,0.14) 




(1.00,-0.75,1.00) (0.96, 


-0.72,0.98) 


(1.00,-0.75,1.00) 


(0.98,-0.74,0.99) 


UJg 


0.00 


0.00 


0.00 


0.00 




— > — oo 


-96.38 


— > — oo 


-94.98 



4 Data analyses 



4.1 Overview 

The mixture of generalized hyperbolic distributions model is illustrated on simulated and 
real data. We consider cluster analyses, but these mixture models could equally well be 
applied for semi-supervised classification, discriminant analysis, or density estimation. In 
each of our clustering analyses, the true classifications are known but treated as unknown for 
illustration. While this sort of synthetic clustering example may not be considered quite akin 
to real clustering, it is representative of what has become the norm with the model-based 
clustering literature. Furthermore, the real data sets that we use are selected because of 
their popularity as benchmark data sets within the aforesaid literature. 

Because we know the true group memberships, we can assess the performance of these 
mixture models in terms of cl a ssification accuracy , which we measure using the adjusted 
Rand index (ARI; iRandl . Il971t iHubert and Arabid . Il985[ ). The ARI has expected value 
under random classification and takes the value 1 for perfect class agreement. Negative values 
of the ARI indicate classification worse than would be expected under random classification. 



4.2 Simulated data analyses 

To illustrate the flexibility of our mixture of generalized hyperbolic distributions model and 
the efficacy of EM algorithm model fitting, they were applied to data simulated from a 
mixture of Gaussian distributions and a mixture of skew-t distributions. In each case, 100 
two-component data sets were simulated with ri\ = ri2 = 250 and the models were fitted 
within the model-based clustering paradigm for G — 1, ... ,5. In all cases, a G = 2 com- 
ponent model was selected, perfect classification results were obtained, and the parameter 
estimates are close to the true values (Tables [1] and [2]). 
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Table 2: Mean parameter estimates from the application of our mixture of generalized 
hyperbolic distributions to 100 simulated data sets from a two-component mixture of skew-t 
distributions. 

9=1 g=2 



True Estimated True Estimated 
~jjtg (3.00,3.00) (2.95,3.04) (-3.00,-3.00) (-2.89,-3.11) 
otg (2.00,-2.00) (2.05,-2.06) (-1.00,1.00) (-1.30,1.36) 
S 5 (1.00,-0.75,1.00) (0.99,-0.74,0.98) (1.00,-0.75,1.00) (1.01,-0.76,1.01) 
u g 0.00 0.00 0.00 0.00 
A3 -4.00 -4.12 -10.00 -10.51 



Table 3: Classifications for the chosen mixture of generalized hyperbolic distributions and 
Gaussian mixture model for the crabs data. 









Gen. Hyperbolic 




Gaussian 






1 


2 


3 


4 


1 2 


Blue 


Male 


39 


11 






21 29 


Female 




50 






26 24 


Orange 


Male 






50 




24 26 


Female 






4 


46 


9 41 



4.3 Real data analyses 



4.3.1 Leptograpus crabs data 



Campbell and Mahonl ( 11974] ) reported data on five biological measurements of 200 crabs 
from the genus leptograpus. The data were collected in Fremantle, Western Australia and 
comprise 50 male and 50 female crabs for each of two species: orange and blue. The data 
were s ourced from the MASS librar y for R which conta i ns da ta sets from IVenables and Ripley 
( 119991 ) . These data were used by iRaftery and Dean! (120061 ) to illustrate the performance of 
a variable selection technique for model-based clustering. 

Mixtures of generalized hyperbolic distributions were fitted to these data for G = 1, . . . , 10 
The model chosen by the BIC had G = 4 components and the resulting MAP classifications 
gave ARI=0.82. For a Gaussian mixture model fitted over the same range of G, the BIC 
chose a G = 2 component model that classification performance akin to guessing (ARI=0.03; 
cf. Table [3]). The performance of our mixture of generalized hyperbolic distributions on these 
data compares favourably with other analyses throu ghout the literature. For example, the 
famous MCLUST models ( jFraley and Rafteryl . 120021 ) select a G = 10 component model with 
ARI=0.46. 
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Table 4: Classifications for the chosen mixture of generalized hyperbolic distributions and 
Gaussian mixture model for the Italian wine data. 







Gen. Hyperbolic 




Gaussian 


1 


2 3 


1 


2 


Barolo 


58 


1 


59 




Grignolino 


1 


70 


3 


68 


Barbera 




1 47 




48 



4.3.2 Italian wine data 



Forina et al.l ( 119861 ) reported chemical and physical measurements on three varieties (Barolo, 
Grignolino, Barbera) of wine from the Piedmont region of It aly. There are 178 samples and 
thirteen measurements available within the gclus package (iHurleyl . 120041 ) for R. Mixtures 
of generalized hyperbolic distributions were fitted to these data for G = 1, . . . , 10. The 
BIG selected a G = 3 component model with ARI=0.95. Mixtures of Gaussian distribu- 
tions were fitted over the same range of G and the BIC selected a G = 2 component model 
with ARI=0.55. Again, the classification performance of our mixture of generalized hyper- 
bolic distributions is favourable compared to the state-of-the-art. To illustrate this point, 
MCLUST selects G = 10 component model with ARI=0.48. 



5 Discussion 

A mixture of generalized hyperbolic distributions has been introduced. Parameter estima- 
tion, via an EM algorithm, was enabled by exploitation of the relationship with the GIG 
distribution. The mixture models were illustrated in two real clustering applications where 
they outperformed Gaussian mixture models and performed favourably when considered in 
the context of the wider literature. Although illustrated for clustering, mixtures of general- 
ized hyperbolic distributions can also be applied for semi-supervised classification, discrimi- 
nant analysis, and density estimation. They represent perhaps the most flexible in a series 
of alternatives to the Guassian mixture models for clustering and classification. What sets 
the mixture of generalized hyperbolic distributions apart from other alternatives is its role 
as a superclass containing the others as special or limiting cases (cf. Section [272]) . 

The distinguishing parameters of the generalized hyperbolic distributions are the con- 
centration parameter u and the shape parameter A. These two parameters arise from the 
latent GIG variable. The concentration parameter is similar to the degrees of freedom in 
the t-distribution in that it allows the shifting of density from the tails of the distribution to 
the central mode, which is the mean if the skewness is zero. Multivariate distributions with- 
out a concentration parameter, such as the the shifted asymmetric Laplace distribution and 
the Gaussian distribution, can try to shift density in this way using their scale parameter; 
however, this approach can often fail due to outliers. 
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The precise impact of a superclass on what may become general practice is currently 
open for debate. Certainly, we do not suggest that one should use our mixtures of general- 
ized hyperbolic approaches exclusively, completely ignoring more well-established approaches 
such as Gaussian mixtures. However, results obtained to date suggest that application of 
mixtures of generalized hyperbolic in real cluster analyses can outperform its special cases 
and this should not be ignored. Future work will focus on the introduction of parsimony 
to mixtures of generalized hyperbolic distributions and a detailed study comparing the re- 
sulting families to their special-case counterparts. Parsimony can be achieved by imposing 
constraints on the Si, . . . , S^, which is a somewhat natural approach because, for all but 
very small p, most of the model parameters are there. These constraints could be in the form 
of an eigen-decomposition, as used by M CLUST. There is also the possibility of considering a 



generalized hyperbolic analogue o f the mixture of factor analyzers (IGh ahra mani and Hinton 



19971 : iMcLachlan and Peel l2000h or mixture of common factors (IBaek et all |2010| ). 

Before we could consider mixtures, an ident inability issue around generalized hyperbolic 
distributions needed to be overcome in a fashion that would not be prohibitive for clustering 
and classification. The re-parameterization that we used in Section 12.31 allowed us to relax 
the restriction that |S| = 1 and thereby obtain meaningful clustering results. However, if 
one wanted to look beyond clustering or classification results to interpreting the estimated 
value of the skewness parameter a., another parameterization would be required. To see why, 
consider X ^ QZ{\ n, S, (3) and note that E[X] = /x + a; the consequences are apparent 



by comparison of fi g 
of a, again set u = v 



OL g and fi g + ct g in Tables [T] and [2j To allow proper interpretation 
'xip but now fix 



E [W] 



ip K x (s/x$) 

which allows proper interpretation of the skewness parameter a while also forcing 



K x+1 (Vg) 



and 



X 



Kx(Vx^) 
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